Skip to content

fix: redesign attempt_number semantics#781

Merged
alexluong merged 5 commits intomainfrom
662-attempt-number-semantics
Mar 23, 2026
Merged

fix: redesign attempt_number semantics#781
alexluong merged 5 commits intomainfrom
662-attempt-number-semantics

Conversation

@alexluong
Copy link
Collaborator

@alexluong alexluong commented Mar 23, 2026

Summary

  • Manual retries derive attempt_number from logstore (was hardcoded 0)
  • Manual retry failures interact with the retry schedule: the scheduler has upsert semantics (same RetryID overwrites), so a failed manual retry atomically replaces any pending auto retry with the next backoff tier
  • Budget exhaustion on manual retry cancels any lingering scheduled retry

Test plan

  • Unit tests pass (internal/deliverymq, internal/apirouter, internal/models)
  • E2E tests pass (cmd/e2e)
  • New TestE2E_ManualRetryScheduleInteraction validates manual retry timing and scheduler logic

Closes #662

🤖 Generated with Claude Code

@vercel
Copy link

vercel bot commented Mar 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
outpost-docs Ready Ready Preview, Comment Mar 23, 2026 5:54pm
outpost-website Ready Ready Preview, Comment Mar 23, 2026 5:54pm

Request Review

@alexluong alexluong changed the title fix: redesign attempt_number semantics (#662) fix: redesign attempt_number semantics Mar 23, 2026
…ing (#662)

- Make attempt_number 1-indexed (was 0-indexed) across all delivery paths
- Manual retries derive attempt_number from logstore instead of hardcoding
- Manual retry failures now interact with retry schedule (cancel + reschedule)
- Budget exhaustion on manual retry cancels any lingering scheduled retry

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace call-recording assertions with stateful map that mirrors real
scheduler upsert/delete semantics. Tests now seed initial retry state
and assert the resulting state after handler execution. Use
ScheduledBackoff so each tier has a distinct delay, proving the
schedule advances to the correct tier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +108 to +123
// 3. Derive attempt number from existing attempts
attemptResp, err := h.logStore.ListAttempt(c.Request.Context(), logstore.ListAttemptRequest{
TenantIDs: []string{event.TenantID},
EventIDs: []string{req.EventID},
DestinationIDs: []string{req.DestinationID},
Limit: 1,
SortOrder: "desc",
})
if err != nil {
AbortWithError(c, http.StatusInternalServerError, NewErrInternalServer(err))
return
}
attemptNumber := 1
if len(attemptResp.Data) > 0 {
attemptNumber = attemptResp.Data[0].Attempt.AttemptNumber + 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't have a better option, this is potentially problematic. Data in the log store is not atomic and can take time to be retrievable. It's possible to end up with duplicates (attempt number) and the performance of the query is likely not very good on CH.

Perharps that's acceptable tradeoff but I wonder if there's a better approach we haven't considered yet

Copy link
Collaborator Author

@alexluong alexluong Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't see an alternative for deducing attempt_number based on available data. The only concern I have with the data is there's a 10-30s lag between the delivery and when the log data is persisted (batching). Otherwise, this query should be sufficient.

Perf is always something to consider because you can never retrieve a single item with CH but I think it's a tradeoff. Actually maybe we can use this attempt data and skip querying the Event, so it should theoretically be the same as before. It's also in the retry flow which I think is not as critical compared to the initial publish flow?

alexluong and others added 2 commits March 23, 2026 21:15
Retry executor now queries logstore for the latest attempt instead of
carrying a stale attempt number in the RSMQ message. Idempotency key
changed to event_id:destination_id:attempt_number so manual and auto
retries with the same attempt number are deduplicated (race protection).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t number

ListAttempt already returns the associated Event on each AttemptRecord.
Replace the two-query pattern (RetrieveEvent + ListAttempt) with a
single ListAttempt call in both the retry executor and the API retry
handler. Remove RetrieveEvent from RetryEventGetter interface.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EnsureLocalStack/EnsureRabbitMQ/EnsureGCP were called before
testinfra.Start(t) had a chance to skip, causing container startups
in CI even with -short. Add testutil.CheckIntegrationTest(t) at the
top of each test function so the skip runs first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexluong alexluong merged commit 59d99df into main Mar 23, 2026
4 checks passed
@alexluong alexluong deleted the 662-attempt-number-semantics branch March 23, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discussion: attempt_number semantics diverge from Hookdeck API

2 participants